MAT245 Lab 10

Multi-Layer Perceptrons

Structure & Flow of Information

Multi-layer perceptrons (MLPs) are a simple class of neural networks. It's easiest to understand how an MLP works by examining one directly. Below we have a diagram of a basic MLP

There are four input neurons, three hidden neurons, and two output neurons. We can label these as follows:

  • inputs: $i_1, \dots, i_4$
  • hiddens: $h_1, h_2, h_3$
  • outputs: $o_1, o_2$.

Each line between two neurons represents a connection and has an associated weight. For instance, $i_1$ connects into $h_2$; the weight of the connection between these two is denoted $w_{i_1, h_2}$.

A network like the above gives a mapping $\mathbb{R}^4 \to \mathbb{R}^2$. Let's see how a sample input vector $(x_1, x_2, x_3, x_4)$ flows through the network.

  1. Assign each entry in the input vector to the corresponding input neuron, ie. set $i_1 = x_1, \dots, i_4 = x_4$.
  2. Compute the activations of the hidden neurons via the formula \begin{align*} h_i &= f(w_{i_1,h_i} i_1 + \dots w_{i_4,h_i} i_4) \\ &= f(\textbf{w}^{h_i} \cdot \textbf{i}) \end{align*} where $f$ is the network's activation function. The most common activation functions are the relu, sigmoid and tanh functions (pictured below). Notice that we simply compute the dot product of the incoming weights with the inputs and plug the result into the activation function $f$.
  3. Compute the activations of the output neurons via a similar forumla: \begin{align*} o_i &= f(w_{h_1,o_i} h_1 + w_{h_2,o_i} h_2 + w_{h_3,o_i} h_3) \\ &= f(\textbf{w}^{o_i} \cdot \textbf{h}). \end{align*}

So the neural network implements the map $(\textbf{x}, \textbf{w}) \mapsto F_{\mathrm{net}}(\textbf{x}, \textbf{w}) = (o_1, o_2)$ where the outputs $(o_1, o_2)$ are obtained from the process above.

Graphs of common activations


In [1]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.special import expit
from sklearn import datasets, mixture

In [2]:
xs = np.linspace(-5, 5)
fig = plt.figure(figsize=(20, 5))

## Plot relu
ax1 = fig.add_subplot(1, 3, 1)
ax1.plot(xs, np.maximum(0, xs)) 

## Plot sigmoid
ax2 = fig.add_subplot(1, 3, 2)
ax2.plot(xs, expit(xs))

## Plot tanh
ax3 = fig.add_subplot(1, 3, 3)
ax3.plot(xs, np.tanh(xs))

plt.show()


Training a multi-layer perceptron

Suppose $\textbf{x}_1, \dots, \textbf{x}_n$ are samples of data. Each data point $\textbf{x}_i$ is associated to a target value $\textbf{y}_i$, and we want to train a neural network to approximate the mapping $\textbf{x}_i \mapsto \textbf{y}_i$. We do so by choosing the weights $\textbf{w}$ that minimize some error function. For example, if $E$ is the mean squared error function,

$$ E(\textbf{w}) = \frac{1}{n-1} \sum_{i=1}^n \| F_{\mathrm{net}}(\textbf{x}_i, \textbf{w}) - \textbf{y}_i \|^2. $$

In practice we usually have to approximate the optimal weights using a process like gradient descent.

Classification using a multi-layer perceptron

In a classification task, we try to assign each input $\textbf{x}_i$ to one of $n$ different classes/categories. Essentially, we're trying to predict the correct label for the data point. Here's how an MLP is typically used to perform a classification task:

The data

For a classification problem with $n$ classes, the targets $\textbf{y}_i$ are usually one-hot encoded. That is, if sample $\textbf{x}_i$ belongs to the $k^{th}$ class, the corresponding $\textbf{y}_i$ is a zero vector except for a single entry of 1.0 in the $k^{th}$ position. $$ \textbf{y}_i = \left( 0.0, 0.0, \dots, 1.0, \dots, 0.0\right). $$ We interpret this $\textbf{y}_i$ as a probability distribution that assigns all of the probability to the true class.

Converting outputs to probabilities

A standard MLP outputs a vector in $\mathbb{R}^n$. This output needs to be converted into a probability distribution over the $n$ classes. To do so we can apply the $\mathrm{softmax}$ function. Recall that the $\textrm{softmax}$ of a vector $\textbf{x} \in \mathbb{R}^n$ is

$$ \textrm{softmax}(\textbf{x}) = (\sigma(\textbf{x})_1, \dots, \sigma(\textbf{x})_n) = \left( \frac{e^{x_1}}{\sum_{j=1}^n e^{x_j}}, \dots, \frac{e^{x_n}}{\sum_{j=1}^n e^{x_j}} \right). $$

Since $\sum_{j=1}^n \sigma(F_\textrm{net}(\textbf{x}, \textbf{w}))_j = 1$, applying softmax turns the output of an MLP into a probability distribution.

Choosing an appropriate error function

A common loss function for probability distributions is the cross-entropy loss. The cross entropy between two distributions $\textbf{p} = (p_1, p_2, \dots, p_n)$ and $\textbf{q} = (q_1, q_2, \dots, q_n)$ is given by the formula $$ \textrm{cross-entropy}(\textbf{p}, \textbf{q}) = -\sum_{j=1}^n p_i \log(q_i). $$ Here we interpret $\textbf{p}$ as the "true" distribution, and $\textbf{q}$ is the approximatation we want to evaluate. Lower cross-entropy scores are better. For our application, we want to measure the cross-entropy between the true distribution $\textbf{y}_i$ and the prediction $\mathrm{softmax}(F_\textrm{net}(\textbf{x}, \textbf{w}))$. The formula in this case is: $$ \textrm{cross-entropy}(F_\text{net}(\textbf{x}_i, \textbf{w}), \textbf{y}_i) = -\sum_{j=1}^n y_{i, j} \log (\sigma (F_\text{net}(\textbf{x}_i, \textbf{w})_j)) $$

Goals (1)

The iris dataset consists of measurements of the sepal length, sepal width, petal length, and petal width of three different kinds of iris flowers. Our first goal is to build an MLP using sklearn to classify the irises. For visualization reasons, we will only use two of the input dimensions.

  • Load the sklearn iris dataset, selecting any two dimensions from the sample data for inputs into the neural network, and partition the data into 70% training and 30% validation sets. For example, to load the first two columns:

In [3]:
iris = datasets.load_iris()
xs = iris.data[:, 0:2]
ys = iris.target
  • Neural networks tend to perform better when the inputs are scaled to have zero mean and unit variance. Use sklearn.preprocessing.StandardScaler to appropriately scale the training and test sets. Note: While you have to transform both the training and test data, be sure to fit the scaler using the training data only.
  • Use sklearn.neural_network.MLPClassifier with cross-entropy error (the default and only choice) to train an MLP on the subset of the iris data you selected. Note that sklearn will automatically one-hot encode the iris classes for the training process.
  • Make a scatter plot of your validation set. Colour code the points based on the type of iris they represent.
  • Make a filled contour plot (see matplotlib.pyplot.contourf) to visualize the decision boundaries of your classifier. (See below for an example filled contour plot). How do the decision boundaries compare to your scatter plot above?
  • Repeat the plots above for two different choices of activation function. So if you used sigmoid activations above, plot the results of using relu and tanh activations also.

Example of a filled contour plot


In [4]:
X, y = datasets.make_classification(n_samples=250, n_features=2, n_informative=2, n_redundant=0, n_classes=3, n_clusters_per_class=1)
gmm = mixture.GaussianMixture(n_components=3).fit(X, y)

xx, yy = np.meshgrid(np.arange(-5, 5, 0.2), np.arange(-5, 5, 0.2))
Z = gmm.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)

plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Accent, marker='.')
plt.contourf(xx, yy, Z, cmap=plt.cm.Accent, alpha=.6)
plt.show()


Goals (2):

There are a lot of parameters to tune in a neural network. The most important of these are the shape of the network, and the type of activation functions used. The easiest way to evaluate different combinations of parameters is to simply try every single one. This approach is called 'grid search'. If you have enough computing power, grid search is a good way to find optimal parameters.

To help with fitting a models for a grid search, sklearn provides the class sklearn.model_selection.GridSearchCV.

  • Use sklearn.model_selection.GridSearchCV to evaluate different combinations of hidden layer shapes and activation functions (as well as any other parameters you want to experiment with).

Goals (3):

If you have enough time, try fitting a support vector machine (with your choice of kernel function, see sklearn.svm.SVC) to classify the iris dataset. Compare the decision boundaries.